Recognition of cyanobacteria promoters via Siamese network-based contrastive learning under novel non-promoter generation

Abstract It is a vital step to recognize cyanobacteria promoters on a genome-wide scale. Computational methods are promising to assist in difficult biological identification. When building recognition models, these methods rely on non-promoter generation to cope with the lack of real non-promoters. Nevertheless, the factitious significant difference between promoters and non-promoters causes over-optimistic prediction. Moreover, designed for E. coli or B. subtilis, existing methods cannot uncover novel, distinct motifs among cyanobacterial promoters. To address these issues, this work first proposes a novel non-promoter generation strategy called phantom sampling, which can eliminate the factitious difference between promoters and generated non-promoters. Furthermore, it elaborates a novel promoter prediction model based on the Siamese network (SiamProm), which can amplify the hidden difference between promoters and non-promoters through a joint characterization of global associations, upstream and downstream contexts, and neighboring associations w.r.t. k-mer tokens. The comparison with state-of-the-art methods demonstrates the superiority of our phantom sampling and SiamProm. Both comprehensive ablation studies and feature space illustrations also validate the effectiveness of the Siamese network and its components. More importantly, SiamProm, upon our phantom sampling, finds a novel cyanobacterial promoter motif (‘GCGATCGC’), which is palindrome-patterned, content-conserved, but position-shifted.


INTRODUCTION
As a kind of prokaryotic bacteria, cyanobacteria cannot only produce oxygen by plant-like photosynthesis but also work as a green microbial cell factory for valuable natural products, chemicals and biofuels [1].During these biological processes of cyanobacteria, promoters near transcription start sites (TSS) play an important role (i.e.recognize transcription units) in functional gene transcription [2].Due to the specific binding of promoters to RNA polymerases, the identification of promoters helps understand how a gene is expressed [3].However, since promoter recognition by biological assays is usually expensive and time-consuming, it is a difficult task to recognize promoters on a genome-wide scale.
Owing to a wealth of promoter sequences determined by highthroughput sequencing technology in the past two decades [4], computational methods have become a new promising approach to predict potential cyanobacteria promoters [5].
Existing methods can be roughly classified into three categories as follows.Scoring function-based methods exploit diverse promoter features (e.g.transcription binding sites, conserved sequence motifs) to construct a weight matrix or score matrix to rank newly given sequences [6][7][8].Due to the weak characterization of common sequence features, these methods cannot achieve satisfactory prediction.
Machine learning-based (ML-based) methods are capable of recognizing potential sequences in a data-driven manner.They first represent promoter sequences as vectors based on various physicochemical properties and string features.Then, they explore traditional classifiers [e.g.support vector machines (SVM)] to discriminate whether a sequence is a promoter [9][10][11].These methods effectively solve weak characterization of sequence features.However, their performance depends on the quality of hand-crafted features, which hardly capture promoter contextual features.
Although current ML-based and DL-based promoter prediction methods have achieved inspiring results, there are still some issues to be addressed.First, negative generation strategies in existing methods result in over-optimistic predictions, necessitating more effective approaches.Due to the prior absence of the common motif (e.g.−10 region Pribnow Box) in these generated non-promoters, the difference between them and promoters is implicitly significant to over-optimistic prediction.In contrast, as we observed in the literature [18], experimentally determined non-promoters are very similar to promoters.Existing negative sampling strategies cannot handle them well.Second, most existing computational methods for predicting promoters focus on E. coli and B. subtilis, with few methods designed for cyanobacterial promoters.In addition, compared to both E. coli and B. subtilis, cyanobacteria possess a nitrogen metabolism without σ N -types (like σ 54 ), which contains unobserved sequence motifs of cyanobacterial promoters [19].Existing prediction methods cannot dig out novel potential motifs in cyanobacteria.
To address the disadvantages of normal negative sampling methods, this work proposes a novel negative sampling strategy, called phantom sampling, to generate non-promoters that are more suitable for cyanobacteria prediction tasks, which makes the model have better performance and generalization ability.Furthermore, considering the particularity of cyanobacterial promoter structure, this work elaborates on a novel promoter prediction model based on the Siamese network (SiamProm), which amplifies the sequence difference between promoters and nonpromoters to distinguish them more easily.Overall, the main contributions of this work are as follows.
(i) We proposed a negative sampling method that is more suitable for promoter prediction tasks.It generates more real non-promoters, of which both GC contents and the Pribnow box look more like promoters.Such non-promoters enable prediction models to capture novel sequence features or potential motifs of cyanobacterial promoters.(ii) SiamProm provides a novel multi-view feature representation method that can maximize the distinction between cyanobacterial promoters and non-promoters.Specifically, its representation learning module captures the globally important associations between pairwise k-mer tokens, upstream and downstream contexts of sequences, and the associations between neighboring tokens.(iii) SiamProm, upon our phantom sampling, helps find a novel motif ('GCGATCGC') among cyanobacterial promoters.This motif is palindrome-patterned, content-conserved, but position-shifted.

Promoter collection
We collected 13 705 experimentally validated promoter sequences of cyanobacteria (i.e.Nostoc sp.PCC 7120) from [20].According to TSS, each promoter considers the core fragment from 60 bp upstream to 20 bp downstream regions.Moreover, since the original dataset may contain redundancy or noise [21], we applied CD-HIT-EST [22] with a cut-off value of 0.8 to exclude highly similar promoter sequences.Finally, the dataset contains 12 566 non-redundant promoters, taken as positive samples when building prediction models.

Non-promoter construction
Building a prediction model requires positive samples (i.e.promoters) and negative samples (i.e.non-promoters).However, there are a few non-promoters annotated.Thus, non-promoter generation strategies were adopted.The popular non-promoter generation strategy is sampling sequence fragments from the CDS region (CDS sampling) [5,13,23,24].In addition, some methods employ random generation [14] or partial substitution [12].However, the generated non-promoters do not consist of significant common motifs (e.g. the −10 Pribnow Box), resulting in a significant difference between promoters and generated non-promoters.In other words, it is easy to distinguish promoters from non-promoters with even a simple model.For example, an SVM can achieve good prediction (88.54% accuracy) based on k-mer and GC contents under the CDS sampling strategy, while a DL-based method, DeePromoter [12], achieves 90.93% predicting accuracy (Table 2).Thus, such a factitious difference between promoters and non-promoters causes over-optimistic predictions.
More importantly, as we observed in the literature [18], experimentally determined non-promoters are very similar to the promoters regarding the conserved motif and GC content (Figure 3F and K).When the trained models under these sampling strategies meet real non-promoters, their performance becomes dramatically poor and non-robust.For example, the SVM achieves only 20.59% accuracy, while DeePromoter achieves 26.47% accuracy (Table 3).
To cope with this issue, we designed a new non-promoter generation strategy called Phantom Sampling, which generates a corresponding non-promoter for each promoter.In the core idea of phantom sampling, non-promoters, compared with promoters, contain similar Pribnow Box, TSS and GC content but different bases in other regions.Figure 1 illustrates the steps involved in carrying out the phantom sampling procedure.First, based on each promoter (−60 to 20), both the Pribnow Box (−12 to −7) of the promoter and its TSS (−1 to 0) are fixed.Then, the remaining fragments are concatenated into an interim sequence and divided into 10 parts of equal length.After that, 7 out of the parts are randomized under the constraint that their average GC content has a <5% difference from that of the corresponding promoter.Finally, a non-promoter is constructed by returning these parts to their original positions in the promoter and merging them with the Pribnow Box and the TSS.The one-against-one phantom sampling boosts the prediction model to find hidden novel motifs different from well-known motifs (e.g. the Pribnow Box).Furthermore, it helps maintain an equal distribution of positive and negative samples during training.

Datasets
Random generation, CDS sampling, partial substitution and phantom sampling are applied to generate four groups of nonpromoters, respectively (Table 1).The number of non-promoters in each negative group is the same as that of promoters, where each sequence has a length of 81 base pairs.As suggested in [25], we constructed four datasets (DB1, DB2, DB3 and DB4) by combining promoters with each group of non-promoters accordingly.They are used to evaluate the prediction model  under 10-fold cross-validation.In addition, we calculated the average GC content of non-promoters in each dataset [ 26].We generated multiple groups of non-promoters based on phantom sampling to evaluate the robustness of SiamProm; the details are in Supplementary Table S2.Moreover, although these datasets contain generated nonpromoters, we collected 34 extra real non-promoters by literature searching as an independent testing set (IND).See also Supplementary Information.As observed (Figure 3F and K), they contain similar Pribnow Box, TSS and GC content as those of promoters, although they have no transcription function.Such a specific similarity between promoters and non-promoters poses a challenge to promoter recognition.

Model architecture
SiamProm is an end-to-end framework containing an elaborate Siamese network and a binary predictor.The former contains two identical sub-networks, which embed individual members of sequence pairs in parallel.If both sequences in the pair are promoters or non-promoters, we label the pair as a contrastively positive sample.If the sequence pair comprises a promoter and a non-promoter, it is considered a contrastively negative sample.The Siamese network learns sequence representations by distinguishing contrastively positive samples from contrastively negative pairs as well as possible.Each subnetwork comprises four modules: an embedding initializer, a k-mer attention module, a bi-directional context catcher, and a nearest-neighbor aggregator (Figure 2).The embedding initializer splits promoter sequences into a set of k-mer tokens and initializes them into one-hot encodings.Next, it passes the encodings through a fully connected layer and integrates them with token positional encodings.The attention module, containing a multi-head attention layer with residual connections, captures the globally important association between pairwise k-mer tokens.The bi-directional context catcher, implemented by a bi-LSTM, encodes upstream and downstream contexts of tokens, simultaneously characterizing the forward strand and the implicit reverse strand of sequences along with DNA.The nearest-neighbor aggregator, implemented by a 1D-convolution layer with residual connections, encodes the associations between neighboring tokens.In addition, each module, except for the embedding initializer, is followed by an adopter (i.e.MLP) to align token embeddings, which are further pooled (i.e.average-pooling) to obtain sequence embeddings.To obtain the final sequence embeddings, we concatenate all the sequence embeddings derived from the attention module, the context catcher, and the nearest-neighbor aggregator.For fast training, the compressor (i.e.MLP) is located after the embedding concatenation for dimensionality reduction.Finally, the promoter predictor (i.e.MLP) is trained to distinguish promoters from non-promoters.To ensure end-to-end training, the parameters of the representation learning module are frozen after updating, and the predictor module only updates its parameters.The mathematical theory and parameter settings of SiamProm are in the supplementary file.

Prediction measure
To measure the prediction performance, this work adopts four commonly used classification metrics, including accuracy (Acc), sensitivity (Sn), specificity (Sp) and Matthew's correlation coefficient (MCC).They are defined as follows: Specifically, TP is the count of accurately classified promoters, referred to as true positive; TN indicates the count of accurately classified non-promoters, referred to as true negative; FN denotes the count of promoters erroneously classified as nonpromoters, called false negative and FP represents the count of non-promoters mistakenly classified as promoters, termed false positive.Sn is known as recall or the true positive rate, while Sp is known as the true negative rate.MCC is a correlation coefficient quantifying the global difference between correct and predicted labels.

Non-promoter generation analysis
The difference between non-promoters generated by four strategies was investigated.First, the well-known motifs, including the −10 Pribnow Box and the TSS, were illustrated by SeqLogo [27] (Figure 3A-F).The illustration shows that an expected significant motif, 'TAAAAT', is inside the −10 Pribnow box, and a TSS motif is found in promoters.In contrast, neither random generation nor CDS sampling shows motifs (i.e.disordered), while partial substitution sampling weakly shows these motifs.Remarkably, nonpromoters generated by our phantom sampling contain similar motifs as promoters.Moreover, those real non-promoters determined by biological assays exhibit similar motifs as promoters as well.
Then, the GC contents of generated non-promoters and real non-promoters were counted and compared with those of promoters.The statistical distribution difference between promoters and non-promoters was measured by Wasserstein distance (d w ) [28].The less the distance is, the more similar promoters and nonpromoters are in terms of GC contents.The comparison shows that the non-promoters generated by random generation and partial substitution sampling significantly differ from promoters (i.e.The investigation demonstrates that traditional non-promoter generation strategies priorly imply a significant difference between promoters and non-promoters.This aforehand factitious difference would result in over-optimistic promoter predictions.However, real non-promoters show similar Pribnow box and TSS as observed.Thus, existing non-promoter generation strategies would not achieve a good prediction in real scenarios.Furthermore, since the non-promoters generated by our phantom sampling are similar to real non-promoters, it would be helpful to build a fairer and less unbiased predictor that focuses on finding unobserved motifs.See experimental validations for details in the next section.

Method comparison over non-promoter generation strategies
To assess the performance of predicting methods w.r.t.nonpromoter generations, we compared our SiamProm with four baseline methods, including one ML-based method, two DL-based methods [ 12,14] and one ensemble learning method [13].They are brief ly introduced in the following.We retrain it on our datasets.
(iii) CyaPromBERT, a recently published DL-based method, utilizes BERT [29] to perform pre-training on genomes and tunes the pre-trained model on cyanobacteria promoters as the predictor.Due to the inner transformer encoders, it can capture contextual information in the promoter sequence.We tuned its pre-trained module to our datasets.(iv) iPro-WAEL is an ensemble learning-based model composed of CNNs and random forests.Its CNNs utilize word embedding techniques to extract the sequence features, while its random forests utilize traditional sequence-based descriptors as features.We retrain it on our datasets.
The comparisons of our SiamProm with four baseline methods were run over four non-promoter generation strategies that are mentioned above.In all the scenarios of generation strategies, the DL-based methods (i.e.SiamProm, iPro-WAEL, CyaPromBERT and DeePromoter) outperform the ML-based methods (i.e.SVM-based) since the latter relies on handcraft sequence features, which are inadequate to characterize the difference between promoters and non-promoters.Specifically, our SiamProm achieves the best prediction overall; the prediction performance of iPro-WAEL, CyaPromBERT and DeePromoter is in descending order; the basic SVM-based method gives the worst prediction metrics.The comparison demonstrates that SiamProm is more robust to various sampling methods because its Siamese network-based encoding can capture a more powerful representation of the difference between promoters and non-promoters.
On the other hand, the comparison between generation strategies shows that both random generation and CDS sampling give over-optimistic results, where even the SVM-based method can reach >90% accuracy with simple features (GC% and 3-mer) due to the aforehand factitious difference between promoters and generated non-promoters in these scenarios.
Moreover, the prediction performances achieved by all the methods decrease in the scenarios of the partial substitution generation and the phantom generation, respectively, because the generated non-promoters contain the Pribnow box and the TSS, which can decrease the factitious difference.Note that our phantom generation can reduce the factitious difference better than the partial substitution generation.More results can be found in Table 3 and Figure 4 in the next section.

Performance on real non-promoters
To further investigate which non-promoter generation strategy is appropriate in real scenarios, real non-promoters (Table 3) were input into the models built by random generation, CDS sampling, partial substitution and our phantom generation, respectively.As shown in Table 3, the popular generation strategies give rise to poor predictions over all the existing models.In detail, the base SVM-based model achieves poor predictions regarding random generation, CDS sampling and partial substitution.DeePromoter achieves poor prediction (<30% accuracy) in both random generation and CDS sampling but gives moderate prediction (i.e.67% accuracy) in the case of partial substitution, where certain Pribnow boxes and TSS can be found in non-promoters.In addition, it is surprising that CyaPromBERT looks like a random guess in all the scenarios (i.e.50% accuracy).The poor performance of CyaPromBERT may be because the fine-tuning data is different from the pre-training data, making it more difficult to identify real non-promoters.iPro-WAEL got 53% and 62% accuracy in random generation and CDS sampling while giving moderate prediction in partial substitution (i.e.70% accuracy).In short, although inspiring performance is achieved in the case of crossvalidation, three popular strategies of non-promoter generation result in dramatically reduced performance in the case of real non-promoters.
In contrast, our phantom sampling leads to significantly improved predictions over all the models.Compared to other generation strategies, the phantom generation helps the base SVM-based model achieve a moderate prediction (61.67%) with a 17.55%-44.02%improvement in accuracy.Similarly, it helps DeePromoter achieve 73.52% accuracy with a 5.88%-50.00%improvement.In addition, iPro-WAEL achieves 76.47% accuracy with a 5.89%-23.53%improvement.It also helps our SiamProm achieve the best prediction (85.29% accuracy) with 2.94%-26.47%improvement.The results demonstrate that our phantom generation is more appropriate to real scenarios.
On the other hand, our model can improve the prediction significantly over all the non-promoter generation strategies.In detail, it achieves 5.88%-41.17%improvement under random generation, 2.94%-44.11%improvement under CDS sampling, 5.89%-32.35%improvement under partial substitution and 11.76%-26.56%improvement under phantom generation.Especially under phantom generation, the performance of SiamProm on real non-promoters (88.23% in Table 3) is consistent with the performance on generated non-promoters (90.30% in Table 2).The comparison validates that our SiamProm significantly outperforms the existing methods.
Furthermore, we illustrated how well promoters can be distinguished from non-promoters in the latent space.By tdistributed stochastic neighbor embedding (t-SNE) [30], under the phantom generation strategy, we mapped the sequence representation vector generated by the SVM, CyaPromBERT, DeePromoter and SiamProm into 2-dimensional latent spaces, respectively (Figure 4).These latent spaces show that our SiamProm has learned the difference between promoters (yellow) and nonpromoters, including the generated non-promoters (blue) and the real non-promoters (red).The illustration demonstrates that our phantom generation can produce more appropriate non-promoters, which are more similar to real non-promoters.More importantly, the prediction model must learn other hidden sequence motifs because non-promoters show similar Pribnow boxes, TSS regions and GC contents as promotes (Figure 3F and K).See how SiamProm finds a novel motif in section Motif Analysis.

Ablation study
In this section, we explored the contributions of crucial modules of SiamProm to the recognition of promoters through ablation studies.
Four SiamProm variants were constructed, and each masks a module in the representation learning stage.The first removed the whole Siamese network (denoted as w/o Siamese).The second one removed the attention module in the Siamese network (denoted   as w/o attention).The third removed the bi-directional context catcher (denoted as w/o BCC).The last removed the nearestneighbor aggregator (denoted as w/o NNA).The experiments were run in 10-fold cross-validation and an independent test on real non-promoters, respectively.The comparison reveals that SiamProm outperforms other variants significantly (Table 4).The results demonstrate that the Siamese network is the most crucial module in our model.The underlying reason is that it can capture the difference between promoters and non-promoters.Its inner contrastive learning pulls within-class samples closer and pushes betweenclass samples farther away in the feature space.Moreover, three other modules in the Siamese network play untrivial roles in sequence embedding, respectively.Precisely, its k-mer attention module captures the globally important association between pairwise k-mer tokens.Its bi-directional context catcher encodes upstream and downstream contexts of sequences, while its nearest-neighbor aggregator encodes the associations between neighboring tokens to capture the local features of sequences.
In general, all these modules contribute to sequence representation, resulting in a good prediction.

Motif analysis
This section illustrates that our SiamProm can discover a novel potential motif among cyanobacterium promoter sequences.We leveraged the self-attention module of SiamProm (formula 4) to highlight the important k-mer token pairs.Technically, the attention scores of all promoter sequences w.r.t.k-mer tokens were averaged and normalized to obtain the global attention matrix (Figure 5A), where columns represent upstream tokens and rows represent downstream tokens.As we observed, the higher self-attention values between two tokens always indicate the nearer tokens along the sequence.Thus, a high attention score between two tokens possibly implies a longer consecutive token.For example, the high attention score between 'CGA' and 'TCG' ref lects the token 'CGATCG'.
We further counted the occurrences of all possible 6-mer tokens, where the above 10 tokens are just ranked as the top 10 (Figure 5C).In detail, the result shows that the occurring number of 'TAAAAT' is the largest among all the tokens, and the tokens of Rank 5-Rank 10 are similar to 'TAAAAT', which consists of only adenine (A) and thymine (T).After investigating their occurring positions (Supplementary Figure S1 A-J), we found that such A/T-rich tokens correspond to the significant −10 motif of the cyanobacterium promoters [31].Moreover, it is remarkable that three novel tokens, 'CGATCG', 'GATCGC' and 'GCGATC', are ranked as the second, third and fourth, respectively, in terms of occurring number.According to their positions, we found that they always occur in an overlapping way.This interesting finding implies a possible longer token (i.e.4-mer token pairs).
In a similar manner, we found that several token pairs have high attention scores (≥ 0.9), including 'GCGA-TCGC', 'AGCG-ATCG', 'CGAT-CGCT' and 'CGAT-CGCC' (Fig. 5B).After counting the occurrences of their compositive 8-mer tokens (Fig. 5D), the top-10 rankings surprisingly show a possible palindrome 8-mer token, 'GCGATCGC'.In detail, the occurring number of 'GCGATCGC' is ranked as the top-1 among all the tokens and nearly seven times that of the Rank-10 token 'TAAAATTA' (i.e. the extension of the well-known −10 motif).More importantly, the tokens of Rank-2-Rank-9 are highly similar to 'GCGATCGC', and their longest common sub-token, 'CGATCG', determined by multiple sequence alignment, is just ranked as the second in terms of 6mer occurring number.Furthermore, by investigating their occurring positions (Supplementary Figure S2 A-H), we found that such a palindrome 8-mer token randomly occurs at any position along the promoter sequence but is excluded from the −10 position (Figure 5E and F).Therefore, we believe the palindrome token is a potential promoter motif with sequence conservation and position variance.
Moreover, we observed similar observations of palindrome tokens reported in the literature.For example, Xu et al. [32] explored that the restriction endonuclease MspI can recognize 5'-CCGG.Al-Attar et al. [33] found that clustered regularly interspaced short palindromic repeats (CRISPRs) are the hallmark of an ingenious antiviral defense mechanism in prokaryotes.Deng et al. [34] synthesized repetitive extragenic palindromic (REP) sequences in E. coli as an efficient mRNA stabilizer for protein production and metabolic engineering.Li et al. [35] discovered that the palindromic sequence Div2 (GTAAACATGTTTAC) could bind to specific transcription factors to regulate gene transcription, thereby affecting gene expression levels.Inspired by the above findings, we believe that the motif 'GCGATCGC' in cyanobacterium promoters has a certain function in promoter recognition.More biological assays are needed to validate its functions further.

DISCUSSION
Despite the crucial role of cyanobacteria promoters, their identification is more difficult than that of coding regions [36].The underlying reason is that cyanobacteria contain unknown motifs in addition to the well-known bacteria motif (i.e.Pribnow Box in −10 region), compared with other well-studied bacteria (e.g.E. coli or B. subtilis).In addition, some current works focus on detecting regulatory elements and structural units of chromatin [37][38][39], which helps researchers more comprehensively understand the function of species genes.
To assist biological assays, both ML-based methods and DLbased methods train a promoter recognizer (classifier) based on known promoters and non-promoters.In general, the nonpromoters are generated by various generation strategies due to the lack of real non-promoters.Aiming to eliminate the factitious difference between promoters and generated non-promoters, this work proposes a novel non-promoter generation strategy, phantom sampling.The phantom sampling enforces the prediction model to find hidden novel motifs, different from well-known motifs, in other regions.
To dig out such hidden motifs, this work elaborates a novel promoter prediction model based on the Siamese network (SiamProm), which jointly characterizes sequences by capturing global associations, upstream and downstream contexts and neighboring associations w.r.t.k-mer tokens simultaneously.
Moreover, the Siamese network amplifies the hidden difference between promoters and non-promoters by ensuring promoters or non-promoters are close to themselves while promoters and non-promoters are distant from each other.The comparison with other state-of-the-art methods in both classification and feature visualization demonstrates the superiority of SiamProm.
More importantly, our SiamProm, upon phantom sampling, finds a novel cyanobacterial promoter motif ('GCGATCGC'), of which the occurring number is significantly larger than that of the Pribnow box.It also shows interesting properties, such as palindrome-patterned, content-conserved, but position-shifted.We believe that our SiamProm provides new insights into the structure of cyanobacteria promoters and can also be borrowed for promoters of other species.In the future, the function of the palindrome motif 'GCGATCGC' shall be investigated by biological assays.Moreover, this new structural information about cyanobacteria promoters can be helpful to novel promoter design in synthetic biology.
In addition, this study still has some limitations that are currently difficult to solve.In genomics, the biological functions of cyanobacteria are not only regulated by promoters, but it is also very important to detect the regulatory elements and structural units of chromatin.In future research, we will explore more of the intrinsic connections between other functional elements and promoters to explore the mechanism of cyanobacterial functions.

Key Points
• Phantom sampling generates more real non-promoters, of which both GC contents and the Pribnow box look more like promoters.Such non-promoters enable prediction models to capture novel sequence features or potential motifs of cyanobacterial promoters.• SiamProm leverages contrastive learning to obtain discriminative sequence representations between promoters and non-promoters.Specifically, its k-mer attention module captures the globally important association between pairwise k-mer tokens.Its bi-directional context catcher encodes upstream and downstream contexts of sequences, while its nearest-neighbor aggregator encodes the associations between neighboring tokens to capture the local features of sequences.• SiamProm, upon phantom sampling, helps find a novel motif ('GCGATCGC') among cyanobacterial promoters.This motif is palindrome-patterned, content-conserved, but position-shifted.

Figure 2 .
Figure 2.The architecture of SiamProm.The dataset construction module takes a pair of sequences as input, and the input labels include their labels and the contrastive labels.The embedding initializer splits sequences into a set of k-mer tokens and processes them into token embeddings with positional encodings.The k-mer attention module captures the globally important association between pairwise k-mer tokens.The bi-directional context catcher encodes the upstream and downstream contexts of sequences.The nearest-neighbor aggregator encodes the associations between neighboring tokens to capture the local features of sequences.The predictor acquires the representation of a sequence to distinguish whether it is a promoter or a nonpromoter.
d rand w = 11.14 and d part w = 7.52, respectively).In addition, those generated by CDS sampling are moderately different from promoters (i.e.d CDS w = 3.73).In contrast, both the non-promoters generated by our phantom sampling and the real non-promoters show similar GC-content distributions as promoters (i.e.d phan w = 1.85 and d real w = 1.00, respectively).

Figure 3 .
Figure 3.The results and analysis of non-promoter generation.(A-F) Seqlogo maps generated by promoters and different non-promoters.They display the distribution of bases at different positions in the sequences.(G-K) GC content distribution of promoters and non-promoters.
(i) We designed an SVM-based method as the baseline.It was trained based on concatenating two popular handcraft features, including 3-mer (64-dimensional) and GC percentage (1-dimensional).We implemented the SVM by taking the radial basis function as the kernel function with a penalty factor set to 1.0 and setting the maximum iteration to 1000.(ii) DeePromoter is an end-to-end DL-based model.It first uses a set of CNNs with size-varied convolutional kernels to characterize nucleotide base neighborhoods, then utilizes a Bi-LSTM to capture the long-distance association between nucleotide bases, and finally builds an MLP as the predictor.

Figure 4 .
Figure 4. Latent spaces are built by five models under the phantom sampling.Only 2000 promoters and generated non-promoters are shown to give an easy-looking illustration.Real non-promoters are highlighted by red triangles.

Figure 5 .
Figure 5. Motif analysis of promoters.(A) 3-mer attention map of promoters.We filtered 3-mer pairs with attention scores less than 0.9 and displayed specific attention scores with thresholds greater than 0.92.The highlighted tick labels show the 3-mer pairs of Rank-2-Rank-4.(B) 4-mer attention map of promoters.We filtered 4-mer pairs with attention scores less than 0.9 and displayed specific attention scores with thresholds greater than 0.95.The highlighted tick labels show the 4-mer pairs of Rank-1.(C) Top 10 6-mer fragments.(D) Top 10 8-mer fragments.Y-axis tick labels show a simple sequence alignment of Top 9 8-mer fragments.(E) Position distribution of 8-mer GCGATCTC.(F) Position distribution of cyanobacterial 8-mer TAAAATTA.

Table 2 :
Performance comparison of five models with four negative generation methods on test datasets Note: Boldface values represent the best values of the metrics under corresponding sampling.

Table 3 :
Performance on real non-promoters Note: Boldface values represent the best values of the metrics under corresponding sampling.Since all the real non-promoters are negative samples, the metric values in the table are Acc (%) or Sp (%).

Table 4 :
Ablation comparison Boldface values represent the best values of the metrics under phantom sampling.Acc * represents the accuracy (Acc%) or specificity (Sp%) of the model on real non-promoters. Note: